Loan Data from Prosper

Preliminary Wrangling

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. The dataset contains so-called listings which either have been transformed to a loan or not. Partially funded loans are possible as well. My main overall interest might be why and who is becoming a so-called Prosper borrower and furthermore what is mainly influencing the interest rate. Interesting would be how the average Prosper rate is compared to the normal financial market.

What is the structure of the dataset?

Some Information about This Dataset can be found in my Google Drive link below. (https://drive.google.com/file/d/1SnL9_i9fbUx2M2MYTxwZ5jyLbOeLvQAD/view?usp=sharing). <------

What is/are the main feature(s) of interest in your dataset?

Based on my high level questions I think these are the main attributes:

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I examined the structure of the dataset utilizing the mentioned sources and categorized 3 main areas with the following attributes. The main attributes are referenced as bold below. I assume the other attributes are helping to explain variations and patterns observed in the data. However, they might be not taken into consideration, depending on the analysis.

Gather Data

Univariate Exploration

Some cleaning basic cleaning upfront

Missing Values

Comment: Ok, let's comment and decide what to do?

Ther is fundamaental amount missing for closed date, which I believe is mainly die to the fact that the majority of loans are still running. Furthermore the section at the bottom, introduced bay TotalProspertLoans, TotalProsperPaymentBilled, etc. are aggregations based on historical data, hence if we haev first time lenders, those will be not filled.

Start with the Univariate Analysis ...

ProsperRating (Alpha) / ProsperRating

The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009. 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA.

Comment: As we have 25% not populated because the Prosper Rating started after July 2009 I thought maybe a simple rule (or even regression) for the derivation of the Prosper rating based on the external one would be easy. But it is not as e.g. D is between 680 - 699 and E later down as well. So let's flag them as before_July09 and analyze keeping decide at th end to keep or to get rid of them.

Comment: For further analysis, we need to exclude the records "before_July09" (so let's remove them). The other rating counts are nicely centred around "C" which divides lower risks and higher risks.

ProsperScore

A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.

Comment: ok it's 11 levels as seen in the data and in that link.

Comment: The other scores counts are somehow centred around "6" which divides lower risks and higher risks. Although the 8 and 4 are as well very dominant.

Occupation

The Occupation selected by the Borrower at the time they created the listing.

Comment: Oh my god, seems to be free text entry, actually not that bad. 67 different professions, majority is in Others, btw the missing values (3.1%) we can move to others

Comment: Leaving the occupation euqals to "Other" out gives a good picture on the different professions, however most counted vales are "Other" and "Professional".

EmploymentStatus

The employment status of the borrower at the time they posted the listing.

Comment: Change to catergorial.

Comment: Most values are labeled as "Employed". 2nd highest is "Full-time" which could be as well "Employed". The same applies for "Part-time", where is the real distinguishing to "Employeed". A dimension reduction strategy here is to combine already categories: Employed + Full-time = Employed(full-time)

Comment: Most lenders are "Employed (full-time) only a small proportion is "Self-employed". Even less is the proportion of "Part-time" workers.

IsBorrowerHomeowner

A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.

Comment: Nearly evenly distributed, this is nice and can be analyzed further. In Bivariate e.g. together with income etc...

BorrowerState

The two letters abbreviation of the state of the address of the borrower at the time the Listing was created.

Comments: Ok, California is leading clearly that distribution, might be that all of those "Computer Programmers" are located in Silicon Valley.

ListingCreationDate

The date the listing was created.

Comment: The data typs is string let's covert to date

Comment: There is a clear upward over the years, as 2014 is only available till mid of march it is not really a drop.

LoanOriginationDate

The date the loan was originated.

Comment: Nearly the same as the ListingCreationDate.

ListingCategory

The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

Comment: This is not a surprise, actually in the various articles I read about P2P lending, it is often mentioned that this type of loan is ideal to consolidate different loans including credit card debts. Here we can see that more than the half of lall loans are used for that purpose. Also here merge "Not Available" to Others.

BorrowerAPR

The Borrower's Annual Percentage Rate (APR) for the loan.

Comment: Difficult ... there are a lot of values at the end in the bins of 0,35 - 0,36 (btw. which is more than 30%). However looking to the distribution I would say it can be considered as normal distributed.

BorrowerRate

The Borrower's interest rate for this loan.

Comment: Very similar to the APR which is no surprise as the APR is including the fees and teg the rate, actually everthing the borrower needs to pay for. Actually a bit better then the APR as on there is only one spike far right in the bin of 0,33. This might be related to the spike of the BorrowerAPR.

Term

Comment: 36 month term is clearly dominant, followed by 60 month. 12 month seems to be not that important.

LoanStatus

Comment: Most of the loans are in good shape either "Completed" or "Current", some of the loans are in in a Past Due category indicating that the lender is behind the payment schedule. Finally there is a considerable amount of loans defaulted (comes after the Past Due) and even more are finally Charged Off (appr. 6%). Which indicates the potential risk.

DebtToIncomeRatio

The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%). The lower the better!

Comment: As I don't konw how the Ratio is excatly caluclated I need to exclude the NAs. The 19 records with 0.0 I treat as properly calculated

Comment: This isn't looking normal distributed at all. As it's a financial KPI and the original scale is pretty much right skewed, a log scale might better explain the distribution.

Comment: Better the log transformed scale shows a normal looking distribution.

Comment: The Q-Q Plot and the standard test isn't perfect but the log10 scale plot looks definitely more normal than the original scale.

IncomeRange

The income range of the borrower at the time the listing was created.

Comment: I think it could be better to look after the organic distribution of the income the borrowers have at hand at the time of the listing creation. let's have a look on StatedMonthlyIncome.

StatedMonthlyIncome

The monthly income the borrower stated at the time the listing was created.

Comment: I could can only produce a meaningful plot by filtering below 40k. The data is exremly right skewed and definitely on the original sclae not normal distributed.

Comment: A few outliers are pretty much moving the shape to the right. 943 "Monthly Incomes" are higher than 20000$.

Comment: The Q-Q Plot is not really underlining the normality. However the log transformed plot is really much more following a "bell-shape" then the original scale. Let's check as well closer in the up-coming analysis.

MonthlyLoanPayment

Comment: There are some values here which are extremely low below 10$ and they spread to different categories like completed etc. However out of thise categories the proportion of 0 is very low.

Comment: Not looking normal on the original scale. Outliers are in the bins of 100 and 200

Comment: The log transformation seems to decribe the values pretty well.

Normally we could look how to fill the loan payments, bu as it is small number we can get rid it.

LoanOriginalAmount

The origination amount of the loan.

Seems to be not normal....

Looks a bit better however still not the ideal bell curve or close to it.

Comment: Ok, looking better but still not normal. So I decide to work with the original scale.

PercentFunded

Percent the listing was funded.

Nice sometimes more was funded then requested...

Comment: Most of the times the full amount was paid out.

EstimatedReturn

The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009.

Comment: ok the Estimated return is sometimes 0 or negative. Apart from that it looks pretty normal. What about the Loss?

Estimated Loss

LossEstimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009.

Comment: ok the Estimated return is sometimes 0 or negative. Apart from that it looks pretty normal. What about the Loss?

Comment: From right to left skewed. We will look in the further analysis and stay with the original scale.

Univariate Exploration Summary

Bivariate Exploration

Now move o to the bivariate exploration, first see if there is a any insight in the following attribute comparisons.

Ok, the correlation matrix and the pairplot show some good candidates for continous comparison. I continue based on th equestion I formulated.

Who is using Prosper?

Analyse who is using prosper, which occupation, which state in combination.

Occupation vs. State

Which Occupations coming from which state?

Comment: As expected, the pivot of state and occupation shows that e.g. that the group "Others" and "Professionals" comimng from California (CA) are the most frequent. We have more computer programmers in California then other states. Let's look if there is an intersting insight for lenders owing a house. Do they own houses?

State vs. Homeowner

Are there states with more homeowners in the Prosper Community?

Comment: The Homeowner counts in Californa(CA) are much lesser than eg. Texas. In Florida is nearly even. The smaller the overall Prosper utilization gets the bigger the count of the homeowners get (with a few exceptions). I'm not an US expert but comparing e.g. California and NY with WA as example is also comparing Metroplitan areas vs. counteyside. Which occupation do Homeowners have?

Occupation vs. Homeowner

Are there occupation groups having more houses?

Comment: Group "Others" have less "Homeowners" than e.g. than "Professionals" the 2nd largest group. Executives tend to have more real estate property as well. Overall we know that the split of Homeowners and Non-Homeowners is nearly even. Which Employement Status do the different occupations have?

EmploymentStatus vs. Occupation

Which Occupations have which Employment Status?

Comment: Biggest group "Others" has most of them as "Employed". The following groups "Professional", "Comp. Progr.", "Executive" and have no (or neglectable) count of not employed or retired. Even part-time proportion is very low. So bottomline the Prosper Community sees to be Full-time employed.

There might be a relationship between state and listing type?

Listing Type vs. State

Comment: Nothing special here, California and Debt Consolidation collects most of the counts. Debt consolidation seems to be leading: Do we have occupation categories doing a particular type of loan?

Occupation vs. Listing Type

Are there proffessions doing a particular ListingCategory more often?

Comment: The Other categories are kind of dominating. Professionals do most Debt Consolidation. Followed by Home Improvement and Business. Not a big inside as we know already that the Consolidtion of loans is the strongest use case. Do homeowners do different things withe the money?

Listing Category vs. Homeowner

Are Homeowners more in the Home Improvement Busienss?

Comment: Debt Consolidation is highest for Homeowner as well. Surprisingly Home Improvement is done by non-Homeowner as well (appr. 2,5%). Ok, by now not too much intersting insights let's explore further some quantitative attributes.

How much money do the different occupations have at hand?

Occupation vs. Monthly Income

Comment: The histograms above are ordered by descending by the sum of avaibale incomme in that group.Interestingly the mean income (red line) has a different order. Let's look to that.

Comment: Ok, doctors have the most available income ...

Comment: However Professionals, Computer Programmers and the very large group Others are still above the average income.

Let's see how the overall debt situation is looking like..

Occupation vs. DebtToIncomeRatio

Let's neglect the outlier for the momnet and have a look to the values up to the 3rd quantile.

Comment: Teachers Aide, Waiters, Students and Fodd Service have the highest ratio. Computer Programmers, Professionals are below the average Ratio, however the large group of Others is slightly above. Whart about the extemes > 0,3

Comment: We have 323 above 1 which means those lenders have earn the same amount of money as they have debts(and evern more). There seems to be a concentration between 1 and 3. I don't consider them as they make only a small propoportion. Let's look to the high ratios.

Comment: Food Service, Sales/Retail, Clerical, Teachers as well as Others, Computer Programmers and Professionals have considerable amount of high ratio. How is the score looking for the occupations?

Occupations vs. LoanAmount

How much money do they borrow on average?

Occupation vs. ProsperScore

First let's compare the 2 main scoring attributes ProsperScore and ProsperRating,

Comment: The darker area somehow shows the relationsip between the 2. However Prosper Rating e.g. of "C" gives a range of from 2-11.

Comment: The majority of the different professions are concentrated in the area from C to A. What is the Score telling us?

Comment: It ranges 2-10, so the extreme 1 and 11 are much lesser seen. Espeically the usual suspects Others, Professionals, Executives , Computer Programmer, Teacher etc. are underlining that.

Summary Who is using Prosper:

In general the occupation group "Other" is dominating the listings, whereas other occupations have a small proportion. This is a petty as it might be due to the fact that users have the possibility to choose from a ddlb the other category.

Let's move over to my 2nd question I had and pull some atributes for bi-variate analysis.

Why is Prosper used?

I had immidiatly 2 things in mind why the borrowers use Prosper instead using a normal bank. As we know now already that Debt Consolidation is the most frequent ListingType so loan type, what makes it so attractive? Is it the time from application to the point of time receiving the money. I call that time to money.

Listing Creation Date vs. Loan Origination Date

Comment: Fast is different, however the platform Prosper seems to accelarate with increasing Listings. There is a clear downward trend in Time2Money attribute, which might attract the borrowers. What else could attract borrowers to use Prosper? What else could be reason to use Prosper? For sure the money you need to pay which can be summarized by the Borrower Annual % Rate.

Listing Creation Date vs. BorrowerAPR

Comment: At the end of 2011 the rates have being considerable higher as before and it seems as this was the end of an upward trend, till the mid of 2014 it was constantly going down. Is there a difference on the Occupation?

BorrowerAPR vs. Occupation

Student from a technical school having the best mean APR followed by Judges, Doctors, Invstors as an example. However one can see that the porportion of those aren't very high. The usual suspects in that dataset (Others, Professionalsetc.) come later. Let's check the distribution of the means and bring it to an order.

the means of the occupations seems to be nicely normal distributed... let's review them

Comment: It's not a bargain, the avg. Borrower APR is appr. 22%. Let's look if there is a relation to the score. However Computer Programmer and Executives are getting far better rates then Professionals and the famous Others.

Summary: Why is Prosper used?

Finally move now to question ....

What is influencing the Rates?

There are some "low Brainers" which we can check first.

BorrowerAPR vs. ProsperRating

Comment: The distributions are nicely climbing with decreasing the rating. Most of the distribution are bi- or multimodal. The spread in each category is as well large. On categories AA we find many outliers to the right. In A as well and C, D have outliers as well in both directions. E has outliers to the left. Again below as an example the density and viloin plot for "E"! HR has a relatively small IQR and many outliers to the left an dto the right. So what else influences the APR?

BorrowerAPR vs. DebtToIncomeRatio

Comment: As we know from the beginning the correlation coeff = 0,128, so we have a weak linear relation. Below 2 the rates are going the full range we need to zoom in a bit. One thing which is looking strange is the nearly vertical line between 0,35 and appr. 0,37 along all DtIR.

Below 1 and 0.5 DTIR there seems to be a concentration on the of rates...let's check quickly in a heatmap.

Comment: The majority of the values is concentrated between 0,1 and 0,4 and 0,10 to 0,25. We can see a slight upward trend in the concentrated area, but the rate also must be influenced by something else. Maybe the term?

BorrowerAPR vs Term

We know from the Univariate Analysis that 12 month term are very seldom bit they have a lower rates as a starting point of their distributions. So 36 seems to be multimodal and 60 right skewed. Still all 3 terms give overall a wide range of rates. In 36 we see the 3rd modality which is looking similar to the concentration we saw in the DTIR between 0,35 and 0,37 let's look closer. The Medians are close to each other, the range for 60 is smaller.

Comment: So this can might explain the peak line we saw in the D2IR I was mentioning before, however either it is a combination of attributes or it is another attribute which is primarily deriving the rate. Let's check the Score.

BorrowerAPR vs. ProsperScore

Comment: We can see that the medians of the rates are increasing by decreasing the score. However 2 things are here interesting.

The groups 11 and 1 are relatively small. So there is still something else which controls the rates.Let's look to the income

BorrowerRate vs. IncomeVerifiable

Comment: The appr. 10% not having a verified income do get higher rates. So having all documents ready helps as well here. What else could influence the rate. Property?

BorrowerAPR vs. IsBorrowerHomeowner

Comment: To own a house definity helps. We can clearly see that the median rate is lower for houseowners. Ok let's finally check here amount requested.

BorrowerAPR vs. StatedMonthyIncome

Comment: we can see on the log scale that the most frequent area is 1900 and above 20000 (I mean it's monthly income). However the range of APR is still going from 5 - above 40%. Having a lot of money/month is not an indication to get better rates.

Comment: We see that the fit isn't really good but it emphasizes a bit the less income the higher the rate. In the heatmap we see a that the range goes from 0,1 to 0,4 through all income levels. Let's revisit the categorial income range.

BorrowerAPR vs. IncomeRange

Comment: This makes it really visible the platform Prosper is used by well earning lenders, the spike between 0.3 and 0.4. Having a higher income definitely helps to get the best rates, however there is much variance in the distributions. The proportion of below 25k and even unemployed is very small.

BorrowerAPR vs. MonthlyLoanPayment

Comment: The monthly payment seems to have a limit at appr. 1000$ this is where below most of the rates are concentrated a slight trend can be seen that lower rates are drawn by higher monthly payments. Also notice the peak between 0.3 and 0.4 and 200!

BorrowerAPR vs. LoanOriginalAmount

Comment: The higher the loan the spread of rates decreases. pls. note the peak between 0.3. and 0.4 till 5000! What else can influence the rate, yes the estimate of loss and return for the lender.

Borrower APR vs. Estimated Loss

Comment: Very nice correlation...let's see a fitted line

Comment: The smaller the loss the smaller the APR. We can speak here about a strong negative relationship....so what about the estimated Return?

BorrowerAPR vs. EstimatedReturn

Comment: Surprisingly the Return is as well in negative relationship to the APR. Why is that? It's the way how it is calculated. The estimated yield which is the difference between the estimated yield and the loss.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In general the occupation group "Other" is dominating the listings, whereas other occupations have a small proportion. This is a petty as it might be due to the fact that users have the possibility to choose from a ddlb the other category.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes the Income Range was underlining the relationship of high incomes to better rates.

Multivariate Exploration

Who is using Prosper?

Let's put together some demographical attributes and soem attributes realted to finacials.

Comment: Very interesting to see the demographical aspects together.

Why is Prosper used?

Let's revisit the TimeToMoney topic.

Comment: So the 90 days moving average indicates that there is a nearly constant around appr. 11days. Is there a change by the level of debts and rating...

Ok, pretty busy chart, let's zoom in

Comment: There is a wide range of days between the listing creation and the payout of the loan. There is nothing spectacular here, maybe we check in combination with the loan amount by occupation...

Comment: The values do not differ a lot by the top 10 occupations. As well as in comparison to the mean Time to Money (appr. 11days) and the mean Loan amount (appr. 9300$) over all occupations. This means the speed of the listing process is not significantly differing. If the Time to Money attribute is a factor for Borrowers to use Prosper can't be finally proofed as we would need to compare it with e.g. processing times of traditional banks.

What about the interest rates, let's check the rates, in particular the BorrowerAPR whcih includes all fees...

Comment: Loan Amount vs. the BorrowerAPR by Occupation is pretty much interesting. The mean APR and Loan is pretty much close to each other, however Computer Programmers get the best rates on average, 19,10%. Administrative Assistants get the worst rates, 25,59. Interesting is the fact that loan amounts e.g. 20k seem to have primarily rates better than the average of the respective occupation group. See above the red arrow annotations in the other group.

However if th rates are so good to attract the borowers would need to be analyzed with other data, e.g. data from bankloans for similar purposes and runtimes. For me the rates seem to be far high especially today (and compared to Germany).

This leads us to the question what influences most the BorrowerAPR?

The majority of borrowers are employed, which is evident because it wouldnt be easy to get a loan without a job. To dig further, i will investigate in the next part what are their occupations.

Comment: Most borrowers on Prosper indicate to be Professional, Computer Programmer, Administrative Assistant, Executive, Teacher, Analyst... All those people have chosen to borrow on Prosper instead of going to the conventional way and borrow from their commerciaL bank. This is could be due to an attractive interest rate offered for this categories of people. That what we will going to see later.

Borrower Rate and Average Credit Score by delinquency Status

Comment: Its clear now that for people who are delinquent (defaulted,past due, charged-off payments), banks applies more restricted credit conditions (higher interest rates). Also these people have lower credit scores than people with good status.

Evolution of BorrowerRate over years for Homeowners & Non-Homeowners

Comment: As we can see, the two rates have had the same evolution profile. As expected, rates for individuals with collateral(homeowner) are lower than those without collateral, but the gap between the two rates has decreased significantly since 2011.

Borrower APR (Annual cost)by Employment status and delinquency Status

Comment: This figure confirmes all what have been said already that applied borrower rate is higher for unemployed people and delinquent loans( loans with bad records).

What is influencing the rates?

Let's review some rating, income and loan attributes....

Comment:

Comment: We nicely that the Prosper Rating is a mighty evaluation metric. We can clearly see that the distribution of APR is getting worse from HR to AA. Furthermore we can see that the rates are increasing with higher indebtedness (at least slightly) and decreasing Prosper Rating. Same for the estimated loss it gets slightly higher by increasing indebtedness, but much more by the assignment to the respective Prosper Rating.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?